This is how I intend to study the midterms by checking against learning objective. I should be able to understand the concept and fulfill the learning objective.
The first level checkbox is the learning objective, which is checked if I understand. The second bullet point is supplementary information that I have added. Content in bold are content that I may be unsure of.
Learning outcomesIntroductionMachine Learning Learning FrameworksPipelineDecision Theory Optimization RegressionModel Selection Regularization ClassificationLogistic RegressionSoftmax RegressionNeural NetworksFeedforward Networks Backpropagation Feature Engineering Convolutional Networks Recurrent Neural Networks Kernel MethodsMaximum Margins DualitySupport Vector MachinesKernelsKernelizationGraphical ModelsProbabilityGraph Theory Bayesian Networks (Directed Graphical Models)Markov Random Fields (Undirected Graphical Models)
Define machine learning in terms of algorithms, tasks, performance and experience.
State that the goal of machine learning is generalization.
List three main types of machine learning, e.g. supervised, unsupervised, and reinforcement learning.
Describe some potential dangers in machine learning, e.g. making unethical predictions due to bias in training data, undesirable feedback between machine and human learning, applying an algorithm without understanding its assumptions.
List the steps in the machine learning pipeline: collect data, extract features, design models, train models, select models, evaluate solution.
Define the roles of training, validation and test data in the pipeline, and describe their structure as sets of pairs of inputs/features and targets.
Define the following: model, estimator, model parameter, training objective, optimal parameter, optimal estimator.
Explain the difference between training and prediction.
Explain the difference between underfitting and overfitting.
Give an example in regression of a task, the performance evaluation (via test data), the given experience (via training data), and the training algorithm.
Define and give examples of action, loss, risk, empirical risk and decision.
Actions - Objects considered in decision making
Loss - Amount lost by action based on truth .
Risk - Expected loss given true distribution which is unknown.
Emprirical Risk - Estimated risk given data (which can be training data validation data or test data )
Decision Action chosen based on data
Give examples of different kinds of loss functions.
Define the inability to generalize as risk, and explain why we need to estimate this risk with empirical risk, e.g. training, validation, test errors.
Describe how to derive exact solutions or use gradient descent for optimization problems. Describe how local minima problems can be mitigated.
Explain why we may prefer descent over exact methods in machine learning.
Outline some computational strategies for machine learning: distributed computation, stochastic gradient descent, automatic differentiation.
Compute gradients automatically using PyTorch.
xxxxxxxxxxoptimizer.zero_grad()outModel = model(data)loss = criterion(outModel, target)loss.backward()optimizer.step()
Define regression statistically as estimating the conditional expectation of an unknown target given the observed inputs.
Outline the statistical methodology in regression: learn the conditional distribution using maximum likelihood; derive the conditional expectation.
Given a conditional distribution and data set, write down the (conditional) likelihood, the negative log-likelihood and derive its gradient.
Describe a generalized linear model as a set of conditional distributions (from some exponential family) with a link function. View linear regression as a GLM.
List strategies to overcome overfitting in models with many parameters.
Explain in words the bias-variance tradeoff, and how it affects model selection.
Explain that regularization helps generalization by creating for model selection, a family of models where model complexity is penalized to different extents during training. Describe how the hyperparameters are selected.
Describe the difference between the training objective and the training error.
Give examples of regularizers that are commonly used in machine learning.
Explain how the exact solution for ridge regression stabilizes the estimator.
Explain how gradient descent for ridge regression works through shrinkage.
Describe the difference between classification and regression.
Given a perceptron, draw its decision boundary and decision regions.
Write down the model in logistic regression, and describe how a classifier can be derived from a conditional distribution in the model after learning.
Write down logistic regression as a generalized linear model. Derive the training objective and the training gradient.
Conditional distribution
Link function
Training objective is to minimise negative log likelihood
Training gradient is the average of point gradients
Define one-hot encodings for labels in multiclass classification.
Write down the model in softmax regression. Define the softmax function. Explain why the last column of the matrix of parameters may be set to zero.
Write down softmax regression as a generalized linear model
Conditional distribution
Link function
Training objective is to minimise negative log likelihood
Training gradient is the average of point gradients
Define an artificial neuron. Give examples of activation functions.
Write down forward propagation in matrix notation.
Distinguish between network parameters and network architecture.
Give examples of network architectures in supervised learning (e.g. convolutional networks) and in unsupervised learning (e.g. autoencoders).
Define the training loss and the backpropagated error. Derive the gradients in terms of the backpropagated error using chain rule.
Outline the steps of the backpropagation algorithm. Explain how dynamic programming speeds up the computation of the gradient during training.
Explain why deep learning is more successful today than it was in the 1980s.
Explain the success of deep learning in terms of removing the need for handcrafted features by domain experts.
Give examples of applications of deep learning.
View a deep network with ReLU activations as a piecewise-linear function.
Explain the strengths and the limitations of deep learning using the universal approximation theorem and the no-free-lunch theorem.
Describe how deep learning enables regression/classification with less data by learning lower dimensional structure in high dimensional data
Describe how deep learning advances machine learning (software without explicit programming) by allowing modular blocks to be chained and trained.
Explain how convolution of shared filters and pooling of feature maps solves the dimensionality problem in computer vision.
Compute the tensor convolution of a multichannel feature map by multichannel filters. Define the kernel size, stride and padding in a convolution.
Identify the purpose of different blocks in a given piece of PyTorch code.
Distinguish between feedforward and recurrent networks.
Describe how recurrent networks learn long-term dependencies in sequential data by storing temporal states and accessing them with attention.
Describe attention mechanisms in terms of queries on key-value pairs.
Input for every cell
Output for every cell
Update cell state
- What to forget (A kind of attention) - Key matrix , Query vector - Outputs a 0-to-1 vector - Applied on previous state to forget
Selective remember
Key matrix . Query vector
Outputs a -1-to-1 vector
is added to the cell state.
Selective output
Explain how the problem of finding a classifier with maximal margin widths can be written as a constrained squared-loss optimization problem.
Explain why soft margins are needed for linearly inseparable data, and how it can be written as a constrained optimization problem with slack variables.
Explain how the hinge loss classifier with regularization is the same as the soft margin classifier with slack variables.
Analyze the effects (in terms of margin width and in terms of generalization) of changing the hyperparameters or on the trained classifier.
Comment: As the two strategies are explained side-by-side, it was quite hard to follow. Moreover you are mixing primal-dual (from optimsation) and Lagrangian (from systems world).
Outline two strategies (dual problem with box constraints, exact solution to KKT conditions) for solving optimization problems with inequality constraints.
Solve a dual optimisation problem where the constraints are nicer and where it is easier to implement gradient descent.
Solve the Lagrangian system of equations.
Write down the primal and dual problems in terms of the Lagrangian. Compare the primal and dual optimal values using the max-min inequality.
Given a constrained optimization problem, derive the dual problem and its constraints, and write down the complementary slackness conditions.
This uses both the primal and dual formulations.
The Lagrangian, primal inequalities , dual inequalities and complelementary slackness.
Outline how the dual form of a support vector machine may be derived from the primal form with slack variables (but no need to memorize the dual form).
Define support vectors as feature vectors linearly combined in the optimal . Recognize that often, there are only a few support vectors.
Describe how the value of the multiplier determines the position of a feature vector 𝑥 in relation to the margin of the classifier.
Write down the formula of the resulting classifier in terms of the support vectors, and its offset in terms of a boundary support vector.
Describe the kernel trick as a strategy to reduce computation during training and prediction by writing them in terms of an easy-to-compute kernel as opposed to a difficult-to-compute feature map. Illustrate this with an example.
Define a kernel as a symmetric function which generates Gram matrices that are positive semidefinite. Relate kernels to similarity maps and distance maps.
for all
for all
symmetric (previous property)
all eigenvalues are nonnegative (not imaginary)
Existence of a feature map function
When is large, and is similar (and the distance is small)
Write down the linear, polynomial and radial basis kernels.
Linear Kernel
Polynomical kernel
Radial Basis Kernel
Describe how to use a radial basis kernel in an SVM for training and prediction.
Outline how a representation of as a linear combination of feature vectors can be used to derive the kernel ridge regression.
Describe the pros and cons of using a higher-order kernel (e.g. radial basis kernel or polynomial kernel of high order) for classification or regression.
Define the multinomial and multivariate Gaussian distributions. Recognize in the Gaussian distribution as the covariance matrix.
Write down the resulting multivariate Gaussian distribution after applying a change of basis to a spherical Gaussian distribution.
Define the conditional independence of and given .
Identify the following aspects of a graph: vertex, edge, undirected graph, directed graph, adjacent, child, parent, descendent, ancestor, path, cycle, directed path, undirected path, collider, non-collider.
Identify the following kinds of graphs: acyclic directed graph, complete graph, subgraph, clique, maximal clique.
Define the separation of sets and by . Define the d-separation of sets and by .
Define a graphical model as a collection of random variables with a graph defining the relationship between the variables. Recognize shaded and unshaded nodes as observed and hidden variables respectively.
Recognize directed edges as descriptions of conditional distributions. Recognize undirected edges as descriptions of joint distributions. Define Bayesian networks and Markov random fields.
Define a distribution satisfying the factorization property with respect to a graph using conditional distributions.
Describe ways to define multinomial distributions and Gaussian distributions which factorize with respect to a graph. Given a graph, write down the inverse covariance of the corresponding Gaussian using linear dependencies.
Define the global Markov property of a distribution with respect to a graph. Determine if a statement is true using d-separation.
A distribution satisfies the global Markov property with respect to graph if
and is conditionally independent given i.e.
Understand the Hammersley-Clifford Theorem as stating the equivalence between two ways of defining Bayesian networks.
Understand the concept of explaining away, that conditioning on a collider will lead to dependence between two parent variables.
Define a distribution satisfying the factorization property with respect to a graph using potential functions. If the distribution is strictly positive, give the definition in terms of energy functions.
Define the global Markov property in terms of separation.
A distribution satisifies the global Markov property with respect to graph if
and are conditionally indepedent given , i.e.
How did the class example satisfy this?
Define the pairwise Markov property in terms of non-adjacency.
Understand the Hammersley-Clifford Theorem as stating the equivalence between three ways of defining strictly positive Markov random fields.
If is strictly positive, then the following are equivalent.
satisifies the factorisation property with respect to
satisifies the global Markov property with respect to
satisifies the pairwise Markov property with respect to
Describe ways to define multinomial distributions and Gaussian distributions which factorize with respect to a graph.
Given a graph, write down the inverse covariance of the corresponding Gaussian using the pairwise Markov property.
Recognize a Boltzmann machine as a special case of a multinomial distribution with binary variables and only two-way interactions (second-order energies).